Recurrent Neural Networks

Hui Lin @Netlify

Ming Li @Amazon

2019-01-30

Types of Neural Network

Why sequency?

Speech Recognition \(\longrightarrow\) Get your facts first, then you can distort them as you please.
Music generation \(\emptyset\) \(\longrightarrow\)
Sentiment classification Great movie ? Are you kidding me ! Not worth the money. \(\longrightarrow\)
DNA sequence analysis ACGGGGCCTACTGTCAACTG \(\longrightarrow\) AC GGGGCCTACTG TCAACTG
Machine translation 网红脸 \(\longrightarrow\) Internet celebrity face
Video activity recognition \(\longrightarrow\) Running
Name entity recognition Use Netlify and Hugo. \(\longrightarrow\) Use Netlify and Hugo.

RNN types

Notation

Representing words

\(\left[\begin{array}{c} a[1]\\ aaron[2]\\ \vdots\\ and[360]\\ \vdots\\ Hugo[4075]\\ \vdots\\ Netlify[5210]\\ \vdots\\ use[8320]\\ \vdots\\ Zulu[10000] \end{array}\right]\Longrightarrow use=\left[\begin{array}{c} 0\\ 0\\ \vdots\\ 0\\ \vdots\\ 0\\ \vdots\\ 0\\ \vdots\\ 1\\ \vdots\\ 0 \end{array}\right], Netlify=\left[\begin{array}{c} 0\\ 0\\ \vdots\\ 0\\ \vdots\\ 0\\ \vdots\\ 1\\ \vdots\\ 0\\ \vdots\\ 0 \end{array}\right], and=\left[\begin{array}{c} 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0\\ \vdots\\ 0\\ \vdots\\ 0\\ \vdots\\ 0 \end{array}\right], Hugo=\left[\begin{array}{c} 0\\ 0\\ \vdots\\ 0\\ \vdots\\ 1\\ \vdots\\ 0\\ \vdots\\ 0\\ \vdots\\ 0 \end{array}\right]\)

What is RNN?

What is RNN?

Forward Propagation

\(a^{<0>}= \mathbf{o}\); \(a^{<1>} = g(W_{aa}a^{<0>} + W_{ax}x^{<1>} + b_a)\)

\(\hat{y}^{<1>} = g'(W_{ya}a^{<1>} + b_y)\)

\(a^{<t>} = g(W_{aa}a^{<t-1>} + W_{ax}x^{<t>} + b_a)\)

\(\hat{y}^{<t>} = g'(W_{ya}a^{<t>} + b_y)\)

Forward Propagation

\(L^{<t>}(\hat{y}^{<t>}) = -y^{<t>}log(\hat{y}^{<t>}) - (1-y^{<t>})log(1-\hat{y}^{<t>})\)

\(L(\hat{y}, y) = \Sigma_{t=1}^{T_y}L^{<t>} (\hat{y}^{<t>}, y^{<t>})\)

Backpropagation through time

Deep RNN

Vanishing gradients with RNNs

Gated Recurrent Unit (GRU)

\(a^{<t>}=g(W_a[a^{<t-1>}, x^{<t>}] +b_a)\)

Gated Recurrent Unit (GRU)

\(a^{<t>}=g(W_a[a^{<t-1>}, x^{<t>}] +b_a) \longleftarrow tanh\)

GRU

GRU

The \(\longleftarrow c^{<t>}=0\), \(\Gamma_u = 0\)

cat \(\longleftarrow c^{<t>}=1\), \(\Gamma_u = 1\)

which \(\longleftarrow c^{<t>}=1\), \(\Gamma_u = 0\)

…

was \(\longleftarrow c^{<t>}=1\), \(\Gamma_u = 0\)

GRU

Full GRU

\(c^{<t>}=a^{<t>}\)

\(\tilde{c}^{<t>}=tanh(W_{c}[c^{<t-1>},x^{<t>}]+b_{c})\)

\(\Gamma_{u}=\sigma(W_{u}[c^{<t-1>},x^{<t>}]+b_{u})\)


\(c^{<t>} = \Gamma_u * \tilde{c}^{<t>}+(1+\Gamma_u) + c^{<t-1>}\)

Full GRU

\(c^{<t>}=a^{<t>}\)

\(\tilde{c}^{<t>}=tanh(W_{c}[\ \ \ \ \ \ \ c^{<t-1>},x^{<t>}]+b_{c})\)

\(\Gamma_{u}=\sigma(W_{u}[c^{<t-1>},x^{<t>}]+b_{u})\)


\(c^{<t>} = \Gamma_u * \tilde{c}^{<t>}+(1+\Gamma_u) + c^{<t-1>}\)

Full GRU

\(c^{<t>}=a^{<t>}\)

\(\tilde{c}^{<t>}=tanh(W_{c}[\Gamma_r* c^{<t-1>},x^{<t>}]+b_{c})\)

\(\Gamma_{u}=\sigma(W_{u}[c^{<t-1>},x^{<t>}]+b_{u})\)


\(c^{<t>} = \Gamma_u * \tilde{c}^{<t>}+(1+\Gamma_u) + c^{<t-1>}\)

Full GRU

\(c^{<t>}=a^{<t>}\)

\(\tilde{c}^{<t>}=tanh(W_{c}[\Gamma_r* c^{<t-1>},x^{<t>}]+b_{c})\)

\(\Gamma_{u}=\sigma(W_{u}[c^{<t-1>},x^{<t>}]+b_{u})\)

\(\Gamma_{r}=\sigma(W_{r}[c^{<t-1>},x^{<t>}]+b_{r})\)

\(c^{<t>} = \Gamma_u * \tilde{c}^{<t>}+(1+\Gamma_u) + c^{<t-1>}\)

Long Short Term Memory (LSTM)

\[\begin{array}{cc} GRU & LSTM\\ \tilde{c}^{<t>}=tanh(W_{c}[\Gamma_{r}*c^{<t-1>},x^{<t>}]+b_{c})\ \ \ \ & \tilde{c}^{<t>}=tanh(W_{c}[a^{<t-1>},x^{<t>}]+b_{c})\\ \Gamma_{u}=\sigma(W_{u}[c^{<t-1>},x^{<t>}]+b_{u})\ \ \ \ & \Gamma_{u}=\sigma(W_{u}[a^{<t-1>},x^{<t>}]+b_{u})\\ \Gamma_{r}=\sigma(W_{r}[c^{<t-1>},x^{<t>}]+b_{r})\ \ \ \ & \Gamma_{f}=\sigma(W_{f}[a^{<t-1>},x^{<t>}]+b_{f})\\ & \Gamma_{0}=\sigma(W_{0}[a^{<t-1>},x^{<t>}]+b_{0})\\ c^{<t>}=\Gamma_{u}*\tilde{c}^{<t>}+(1-\Gamma_{u})*c^{<t-1>\ \ \ \ } & c^{<t>}=\Gamma_{u}*\tilde{c}^{<t>}+\Gamma_{f}*c^{<t-1>}\\ a^{<t>}=c^{<t>}\ \ \ \ & a^{<t>}=\Gamma_{0}*c^{<t>} \end{array}\]

LSTM

Word representation

\[\begin{array}{cccccc} Man & Woman & King & Queen & Apple & Pumpkin\\ (5391) & (9853) & (4914) & (7157) & (456) & (6332)\\ \left[\begin{array}{c} 0\\ 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ 0\\ 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0\\ 0\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ 0\\ 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ \vdots\\ 1\\ \vdots\\ 0\\ 0\\ 0\\ 0\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ 0\\ 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0 \end{array}\right] \end{array}\]

Word representation

\[\begin{array}{cccccc} Man & Woman & King & Queen & Apple & Pumpkin\\ (5391) & (9853) & (4914) & (7157) & (456) & (6332)\\ \left[\begin{array}{c} 0\\ 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ 0\\ 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0\\ 0\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ 0\\ 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ \vdots\\ 1\\ \vdots\\ 0\\ 0\\ 0\\ 0\\ 0 \end{array}\right] & \left[\begin{array}{c} 0\\ 0\\ 0\\ 0\\ 0\\ \vdots\\ 1\\ \vdots\\ 0 \end{array}\right] \end{array}\]

Featurized representation: word embedding

Featurized representation: word embedding

Featurized representation: word embedding

Featurized representation: word embedding

Featurized representation: word embedding

Featurized representation: word embedding

Featurized representation: word embedding

Featurized representation: word embedding

Featurized representation: word embedding

Featurized representation: word embedding

Analogies

Analogies

Analogies

Analogies

Analogies

\(e_{man} - e_{woman} \approx e_{king} - e_{?}\)

\(\rightarrow \underset{w}{argmax} \{sim (e_{w}, e_{king} - e_{man} + e_{woman})\}\)

Cosine similarity

\(sim(e_w, e_{king}-e_{man}+e_{woman})\) = ?

Cosine similarity: \(sim(a,b) = \frac{a^{T}b}{ ||a||_{2} ||b||_{2}}\)

Cosine similarity

\(sim(e_w, e_{king}-e_{man}+e_{woman})\) = ?

Cosine similarity: \(sim(a,b) = \frac{a^{T}b}{ ||a||_{2} ||b||_{2}}\)

Embedding matrix

Embedding matrix

Transfer learning

  1. Learn word embeddings from a large text corpus. (1-100B words) (or download pre-trained embedding online)

  2. Transfer embedding to new task with a smaller training set. (say, 100k words)

  3. Optional: Continue to finetune the word embeddings with new data